Multi-Layer Discourse Annotation of a Dutch Text Corpus

نویسندگان

  • Gisela Redeker
  • Ildikó Berzlánovich
  • Nynke van der Vliet
  • Gosse Bouma
  • Markus Egg
چکیده

We have compiled a corpus of 80 Dutch texts from expository and persuasive genres, which we annotated for rhetorical and genre-specific discourse structure, and lexical cohesion with the goal of creating a gold standard for further research. The annotations are based on a segmentation of the text in elementary discourse units that takes into account cues from syntax and punctuation. During the labor-intensive discourse-structure annotation (RST analysis), we took great care to thoroughly reconcile the initial analyses. That process and the availability of two independent initial analyses for each text allows us to analyze our disagreements and to assess the confusability of RST relations, and thereby improve the annotation guidelines and gather evidence for the classification of these relations into larger groups. We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences, e.g., the question of how discourse structure and lexical cohesion interact for different genres in the overall organization of texts. We are also exploring automatic text segmentation and semi-automatic discourse annotation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Discourse-Annotated Dutch Text Corpus

We are compiling a corpus of Dutch texts annotated with discourse structure and lexical cohesion, containing initially 80 texts from expository and persuasive genres. We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences. We are also exploring the possibilities of automatic text segmentation and semi-automatic discourse an...

متن کامل

KNACK-2002: a Richly Annotated Corpus of Dutch Written Text

In this paper, we introduce the annotated KNACK-2002 corpus of Dutch written text. The corpus features five different annotation layers, ranging from the annotation of morphological boundaries at the word level, over the annotation of part-of-speech tags and phrase chunks at the syntactic level to the annotation of named entities at the semantic level and coreferential relations at the discours...

متن کامل

Starting a Sentence in L2 German - Discourse Annotation of a Learner Corpus

Learner corpora consist of texts produced by second language (L2) learners. I We present ALeS Ko, a learner corpus of Chinese L2 learners of German and discuss the multi-layer annotation of the left sentence periphery notably the Vorfeld.

متن کامل

Using Large Multi-purpose Corpora for Specific Research Questions: Discourse Phenomena Related to Wh-questions in the Spoken Dutch Corpus

In this paper, we investigate whether a dataset derived from a multi-purpose corpus such as the Spoken Dutch Corpus may be considered appropriate for developing a taxonomy of wh-questions, and a model of the way in which these questions are integrated in spoken discourse. We compare the results obtained from the Spoken Dutch Corpus with a similar analysis of a large random collection of FAQs fr...

متن کامل

Computational Analysis of Coherence Relations in Dutch

The NWO-programme Modelling textual organisation: coherence and cohesion studies the organisation of text into structural units by means of coherence (discourse relations between clausal and larger textual units) and cohesion (lexico-semantic relations between words in textual units). The programme is organised around two related PhD-projects, focussing on coherence and cohesion, respectively. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012